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Abstract 

Motivation: Computational RNA structure prediction is a mature important problem which 
has received a new wave of attention with the discovery of regulatory non-coding RNAs and the 
advent of high-throughput transcriptome sequencing. Despite nearly two scores of research on 
RNA secondary structure and RNA-RNA interaction prediction, the accuracy of the state-of- 
the-art algorithms are still far from satisfactory. So far, researchers have proposed increasingly 
complex energy models and improved parameter estimation methods, experimental and/or com- 
putational, in anticipation of endowing their methods with enough power to solve the problem. 
The output has disappointingly been only modest improvements, not matching the expectations. 
Even recent massively featured machine learning approaches were not able to break the barrier. 
Approach: The first step towards high accuracy structure prediction is to pick an energy model 
that is inherently capable of predicting each and every one of known structures to date. In this 
paper, we introduce the notion of learnability of the parameters of an energy model as a measure 
of such an inherent capability. We say that the parameters of an energy model are learnable iff 
there exists at least one set of such parameters that renders every known RNA structure to date 
the minimum free energy structure. We derive a necessary condition for the learnability and give 
a dynamic programming algorithm to assess it. Our algorithm computes the convex hull of the 
feature vectors of all feasible structures in the ensemble of a given input sequence. Interestingly, 
that convex hull coincides with the Newton polytope of the partition function as a polynomial in 
energy parameters. To the best of our knowledge, this is the first approach towards computing 
the RNA Newton polytope and a systematic assessment of the inherent capabilities of an energy 
model. 

Results: Wc demonstrated the application of our theory to a simple energy model consisting 
of a weighted count of A-U and C-G base pairs. Our results show that this simple energy 
model satisfies the necessary condition for less than one third of the input unpseudoknotted 
sequence-structure pairs chosen from the RNA STRAND v2.0 database. For another one third, 
the necessary condition is barely violated, which suggests that augmenting this simple energy 
model with more features such as the Turner loops may solve the problem. The necessary 
condition is severely violated for 8%, which provides a small set of hard cases that require 
further investigation. 



*to whom correspondence should be addressed 
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1 Introduction 



Computational RNA structure and RNA-RNA interaction prediction have always been important 
problems, particularly now that RNA has been shown to have key regulatory roles in the cell 
[H El El HI [SI El [7]. Furthermore, with the advent of synthetic biology at the whole organism 
level [8], high-throughput accurate RNA engineering algorithms are required for both in vivo and 
in vitro applications [U [101 El 021 CE] . Since the dawn of RNA secondary structure prediction 
nearly two scores ago p3], the research community has proposed increasingly complex models and 
algorithms, hoping that refined features together with better methods to estimate their parameters 
would solve the problem. Early approaches considered mere base pair counting, followed by the 
Turner thermodynamics model which was a significant leap forward. Recently, massively feature- 
rich models empowered by parameter estimation algorithms have been proposed, but they provide 
only modest improvements. 

Despite significant progress in the last three decades, made possible by the work of Turner 
and others [15] on measuring RNA thermodynamic energy parameters and the work of several 
groups on novel algorithms [HI H3 HHJ HHJ [20], Ell E21 E31 El] and machine learning approaches 
[25lE6lE7j, the RNA structure prediction accuracy has not reached a satisfactory level yet. Why is 
it so? Up to now, human intuition and computational convenience have lead the way. We believe 
that human intuition has to be equipped with systematic methods to assess the suitability of a 
given energy model. Surprisingly, there is not a single method to assess whether the parameters 
of an energy model are learnable. We say that the parameters of an energy model are learnable 
iff there exists at least one set of such parameters that renders every known RNA structure to 
date, determined through X-ray or NMR, the minimum free energy structure. Equivalently, we 
say that the parameters of an energy model are learnable iff 100% structure prediction accuracy 
can be achieved when the training and test sets are identical. The first step towards high accuracy 
structure prediction is to make sure that the energy model is inherently capable, i.e. its parameters 
are learnable. In this work, we provide a necessary condition for the learnability and an algorithm to 
verify it. To the best of our knowledge, this is the first approach towards a systematic assessment of 
the suitability of an energy model. Note that a successful RNA folding algorithm needs to have the 
generalization power to predict unseen structures as well. We do not deal with the generalization 
power in this work and leave it for future work. 

2 Background 

2.1 RNA Secondary Structure Models 

An RNA secondary structure model is often a context free grammar together with a scoring function 
for either the rules, in the case of stochastic context free grammars (SCFG) [28], or the alphabet, 
in the case of thermodynamics models [H]. Such scoring functions induce scoring on the entire 
generated language. The word with optimal score then yields a predicted structure for the given 
sequence. For the sake of brevity, we focus on thermodynamics models in this paper, but it is 
obvious that our methods apply to other models including SCFG as well. In our context, the 
scoring function is the thermodynamics free energy. A secondary structure y of a nucleic acid is 
decomposed into loops; a free energy is associated with every loop in y; and the total free energy 
G for y is the sum of loop free energies [15]. The same loop decomposition principle applies to 
interacting nucleic acids such that the total free energy G is still the sum of the free energies of 
loops and interaction components [22j . 
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2.2 Estimation of Energy Parameters 

Existing machine learning algorithms for parameter estimation in RNA structure prediction can be 
grouped into two categories: 

• Likelihood-based methods, where the maximum likelihood (ML) principle is used to estimate 
the parameters of a probabilistic model, e.g. [25], and 

• Large-margin methods, where the model parameters are estimated to maximize the margin 
between the score of the true structure and the second best structure. This has been done 
using an online passive-aggressive training algorithm [27] and Iterative Constraint Generation 
(CG) 129]. 

The likelihood-based techniques estimate the best Gibbs distribution, which not only assists in 
predicting the best secondary structure but also is utilized in determining the thermodynamic 
parameters. The most successful method for learning the thermodynamics of RNA has been the 
maximum likelihood method, as in CONTRAfold [25] . which maximizes the probability of RNA 
structures y given RNA sequences x for the training set D. That is, the conditional log likelihood 
of the training data (using the Boltzmann distribution) is maximized to estimate the best model 
parameters h* E M. k : 

h* := argmaxL(Z); h) = max > logp(y\x, h), (1) 

h h ' — ' 

(x,y)eD 

e -G(x,y,h)/RT 

^' X ' h):= Q(x,h) ' (2) 

where k denotes the number of different motifs defined in the energy model, R is the gas constant, 
T is the absolute temperature, G(x,y, h) is the free energy, and 

Q(x,h) := £ e -G(*,s,h)/RT (3) 

is the partition function \22\ [20"1 [2~T] with £(x) being the ensemble of possible structures of x. The 
free energy 

G(x,s,h):=(c(x,s),h) (4) 
is a linear function of the parameters h where c(x, s) € Z fc is the features vector. 



3 Learnability 

The question that we ask before parameter estimation is: does there ever exist parameters such 
that for every (x, y) E D, y = arg min s G(x, s, h))? If the answer to this question is no, then there 
is no hope that one can ever achieve 100% accuracy using the given model. The answer reveals 
inherent limitations of the model, which can be used to design improved models. We provide 
a necessary condition for the existence of and a dynamic programming algorithm to verify it 
through computing the Newton polytope for every x in D. We will define the RNA Newton polytope 
below. Not only our algorithm provides a binary answer, it also quantifies the distance from the 
boundary. 
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4 Methods 



4.1 Necessary Condition for Learnability 

Let (x,y) € D and ht € M. k . Assume y minimizes G(x, s,h') as a function of s. In that case 

G(x, y, h f ) < G(x, s, ht), V s e £ (x). (5) 

Replacing (jlj) above, 

(c(x,y),ht) < (c(x,a),ht), Vs£^(i) (6) 
0< (c(x,s) -c(x,y),h)), Vse£(x). (7) 

Define the feature ensemble of sequence x by 

J-(x) := {c(x, s)\se £(x)} C Z fc . (8) 

In that case, ([7]) implies that 

0< (T(x)-c(x,y),h^. (9) 

We call the convex hull of J~(x) the Newton polytope of x, 

J\f(x) := conv {^(x)} C R k . (10) 

We remind the reader that the convex hull of a set, denoted by 'conv' hereby, is the minimal convex 
set that fully contains the set. The reason for naming this polytope the Newton polytope will be 
made clear below. Inequality ([9]) implies that c(x, y) £ dJ\f(x) is on the boundary of the convex 
hull of the feature ensemble of x with a support hyperplane normal to h^. Therefore, we have the 
following theorem. 

Theorem 1. Let (x,y) € D and 7^ h^ £ ]R fc . Assume y minimizes G(x, s,h^) as a function of 
s. In that case, c(x,y) E dAf(x), i.e. the feature vector of (x,y) is on the boundary of the Newton 
polytope of x. 

Proof. To the contrary, suppose c(x,y) is in the interior of Af(x). Therefore, there is an open ball 
of radius 5 > centered at c(x,y) completely contained in Af(x), i.e. 

B s (c(x,y)) cM(x). (11) 

Let 

ht 

V = c{x,y) - (5/2) 



„htir 

It is clear that p G B$(c(x,y)) C Af(x) since \\p — c(x,y)\\ = 5/2 < 5. Therefore, p can be written 
as a convex linear combination of the feature vectors in F(x) = {v±, . . . , vn}, i-e. 

3 ol\, ■ ■ ■ aw > : a\v\ + • • • + a^v^ = p, (12) 

Ctl + • • • + OiN = 1. (13) 

Note that 

[p-c(x,j/),ht\ = -(<y/2)||ht||<0. (14) 
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Therefore, there is 1 < i < N, such that (vi — c(x, y), hJ} < for otherwise, 

N 

(p- c(x,y),ht\ = ^a.j (vi - c(x,y),h^ > 0, (15) 
t=i 

which would be a contradiction with (|14p . It is now sufficient to note that Vi € J~(x) and 
(vi — c(x, y), h> ) < which is a contradiction with ([9]). □ □ 

Corollary 1 (Necessary Condition for the Learnability) . For (x,y) E D, a necessary but not 
sufficient condition for the existence ofh) such that y minimizes G(x, s,lv) as a function of s is 
that c(x,y) lies on the boundary of M{x) the Newton polytope of x. 

4.2 Relation to the Newton Polytope 

In addition to D the set of experimentally determined structures, we often have a repository 
of thermodynamic measurements, e.g. melting curves, that can help better estimate the energy 
parameters. Such measurements often relate to the energy parameters through equations involving 
the partition function and its derivatives with respect to temperature [22]. We show that with 
a change of variables, the partition function becomes a polynomial. Therefore, such equations 
become a system of polynomial equations the solving of which algebraically requires computation 
of the Newton polytope of each polynomial [30(, 13 lj . Recall the partition function defined in ([3]) 
and energy in 0, and conclude 

Q(x,h)= Y, e~^ x ^l RT . (16) 
se£(x) 

Let c(x, s) = (c\(x, s), . . . , Ck(x, s)) and h = (hi, . . . , h^). Define new variables 

Zi := e - h */ RT , 1 < i < k, (17) 
and replace them in (|16j) . We obtain the partition function 

Q(x,Z)= ]T (18) 

se£{x) 

in the form of a polynomial in M[Z] where 

Z c(x,s) .-J[Z«(*>») (19) 

1=1 

is a monomial as < Ci(x, s) € Z. The Newton polytope of Q is defined to be the convex hull of 
the monomials power vectors, i.e. 

Newton {Q(x, Z)} := conv ({c(x, s) | s £ £(x)}) = M(x). (20) 

That is why we call Af(x) the Newton polytope of x. 
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4.3 RNA Newton Polytope Algorithm 

We give a dynamic programming algorithm to compute the Newton polytope for a given nucleic 
acid sequence x. Denote the length of x by L and the i th nucleotide in x by rij. Denote the 
subsequence of x from the i th to the j th nucleotide, inclusive of ends, by • • • rij. The following 
lemma allows us to formulate a divide-and-conquer strategy for computing the Newton polytope, 
which will in turn lead to our dynamic programming algorithm. 

Lemma 1. Let f and g be two polynomials in R[Z]. The Newton polytope of the product of f and 
g is the Minkowski sum of individual Newton polytopes, and the Newton polytope of the sum of f 
and g is the convex hull of the union of individual Newton polytopes, i.e. 

Newton (fg) = Newton (/) © Newton (g) , (21) 
Newton(f + g) = conv {Newton (f) U Newton(g)} , (22) 

in which represents the Minkowski sum of two polytopes I30f . 

This lemma allows us to use the same divide-and-conquer strategy that was used for calculating 
the partition function \22\ [20l [2T] . We can use the same recursions (grammar) as in the partition 
function algorithm but with the Minkowski sum © instead of multiplication, convex hull of union 
instead of summation, and the corresponding feature vector c instead of e~( c ' h ^ RT . Furthermore, 
since union is invariant with respect to repetition of points, the dynamic programming is allowed to 
be redundant, or equivalently the grammar is allowed to be ambiguous. Hence, any complete RNA 
structure or RNA-RNA interaction prediction dynamic programming algorithm can be transformed 
into a Newton polytope algorithm by replacing the energy with the corresponding feature vector, 
summation with the Minkowski sum ©, and minimization with the convex hull of union. 

As explained above, we transform any complete partition function or structure 
prediction dynamic programming algorithm, for single RNA, RNA-RNA interaction, 
or multiple interacting RNAs, into a Newton polytope algorithm. For the sake of 
illustration, we explicitly spell below only the case of single RNA with separate A- 
U and C-G base pair counting energy model. All the other cases are quite trivially 
obtained following the transformations above. 

In this case, the feature vector c(x,s) = (ci(x, s), c%(x, s)) is two dimensional: ci(x,s) is the 
number of A-U, and C2(x,s) the number of C-G base pairs in s. Our dynamic programming 
algorithm starts by computing the Newton polytope for all unit length subsequences, followed by 
all length two subsequences,. . ., up to the Newton polytope for the entire sequence x. We denote 
the Newton polytope of the subsequence ni ■ ■ ■ nj by Af(i,j), i.e. 

:=N{n i ---n j ). (23) 

The following dynamic programming will yield the result 



M(i,j) = conv 



Af(i,e)®Af(£+l,j), i<£<j-l 
U{ {(1,0)} ffi N{i + 1, j -1) if Turij = AU|UA 
{(0,l)}©A/"(i + l,j-l) iiniUj = CG|GC 



(24) 



with the base case Af(i,i) = {(0,0)}. 

There are two different approaches for polytope representation: (i) vertex representation, which 
is a set of points, and (ii) half plane representation, which is a set of linear inequalities. The former 
is often called V-representation and the latter "H-representation. Although they are equivalent, 
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and there are algorithms to transform one into the other, computing Minkowski sum is more 
convenient with the V-representation, and convex hull of union works more efficiently with the 
■H-representation. The choice of representation and algorithms will affect the running time. In this 
paper, we use the V-representation. 

4.4 Verification of the Necessary Condition 

Upon computation of Af(x) and c(x,y), the feature vector of the experimentally determined struc- 
ture, it remains to verify whether c(x,y) G dAf(x). Often, M(x) is represented by its vertices (V- 
representation) or its confining half planes (^-representation) , two equivalent representations that 
can be transformed into one another. In an ^-representation, c(x, y) is on the boundary of Af(x) iff 
there is at least one confining plane on which c(x, y) lies. This is true because c(x, y) € ftf(x) any- 
ways. Therefore, the necessary condition can be easily checked by checking membership of c(x, y) 
in every confining plane. Since the vertices of M{x) are on the integer lattice, all calculations are 
rational and hence can be performed exactly. 

4.5 Dataset 

We used 1720 unpseudoknotted RNA sequence-structure pairs from RNA STRAND v2.0 database 
as our dataset D. RNA STRAND v2.0 contains known RNA secondary structures of any type 
and organism, particularly with and without pseudoknots. To the best of our knowledge, RNA 
STRAND v2.0 is the most comprehensive collection of known RNA secondary structures to date 
[52"] . There are 2334 pseudoknot-free RNAs in the RNA STRAND database. We sorted them 
based on their length and selected the first 1720 ones, whose lengths vary between 4 and 123 
nt. We excluded pseudoknotted structures because our current implementation is incapable of 
considering pseudoknots. Some sequences in the dataset allow only A-U base pairs (not a single 
C-G pair), in which case the Newton polytope degenerates into a line. 

4.6 Implementation 

We implemented the dynamic programming in ()24|) using MATLAB convex hull function which is 
based on the quickhull algorithm [33]. As mentioned above, we used the V-representation and com- 
puted the Minkowski sum by direct pairwise summation of vertices. More precisely, for two convex 
polytopes P with vertices p\ , . . . , p a and Q with vertices q\, . . . , g&, the vertices of P © Q are pi + qj 
for 1 < i < a and 1 < j < b. To verify the necessary condition, i.e. whether the experimentally 
determined feature vector lies on the boundary of the Newton polytope, we calculated the distance 
of the feature vector from the boundary of the polytope using the 'p_poly_dist' MATLAB function 
[34] . A zero distance corresponds to the case where the feature vector lies on the boundary, i.e. the 
condition is satisfied, and a positive distance to the case where the feature vector is in the interior 
of the Newton polytope. We normalized the distance by square root of the area of the polytope, 
which is a planar polygon in this case. The normalized distance quantifies how far the feature 
vector is from the boundary. We parallelized our MATLAB code using MATLAB 'parfor'. We 
ran experiments on the JSLQ nodes of the Wayne State Grid in however non-parallel mode due to 
lack of support. The length of input RNA sequences varied between 4 and about 120 nt. For the 
smallest ones, our program took a fraction of a second and for the longest ones it took less than 10 
minutes to run on a 2.4GHz Dual Core AMD Opteron CPU with at most 16 GB RAM. 
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5 Results 



For each strand of RNA, the distance between c(x,y), the real feature vector of the secondary 
structure and the computed convex hull, Af(x), is calculated using [33]. We denote this distance 
by r(x) here. The necessary condition for the learnability is satisfied if r(x) = for all x in the 
dataset, which shows that the observed feature vector lies on the boundary of Af(x). 

Fig. [T] illustrates the secondary structure of h.5.b.E.coli.hlxnum, an example in the dataset, 
and its Newton polygon. This RNA is 120 nt long, and its final Newton polygon has 12 vertices. 
The native feature vector for this RNA has r = which means that it lies on the boundary of the 
polygon. Fig. [2] illustrates another example, Hammerhead Ribozyme (Type I) whose length is 115 
nt. Its Newton polygon has 15 vertices, and the calculated r is 5 for this case. The big hairpin loop 
in this case suggests that the dominant energetic features are not base pairs, and this is consistent 
with the longer distance observed between M{x) and c(x,y). 

Out of 1720 RNA sequences in our experiment, for 490 (28%) the native feature vector is 
exactly on the boundary of the Newton polygon. For the remaining 1230 sequences, the native 
feature vector settles inside their Newton polygon. Fig. [3J demonstrates the histogram of r{x) 
for the input dataset. For 516 (30%) of input sequences, r is non-zero but less than 1, which 
suggests that the dominant energetic features in those cases are canonical base pairs. For 579 
(34%) sequences, r is between 1 and 5. In the remaining 135 cases (8%), r goes larger than 5. The 
second plot in Fig. [3] shows the normalized distance histogram. The square root of polygon area is 
used as the normalization factor. For 712 (41%) strands, this number is no more than 0.05. 

The relation between the number of vertices and the length of RNA strand is shown in the third 
plot of Fig. [3l The maximum number of vertices is 15 which happens for three different strands, 
each 115 nt long. The minimum number of vertices is two; however, it is clear that no polygon 
exists with two vertices. In those cases, the RNA admits only A-U base pairs or only C-G base 
pairs and not both of them. The resulting polygon is just a line, and c(x, y) always lies on the 
boundary of M(x). 

6 Conclusion and Future Work 

We introduced the notion of learnability of the parameters of an energy model as a measure of 
its inherent capability. We derived a necessary condition for the learnability and gave a dynamic 
programming algorithm to assess it. Our algorithm computes the convex hull of the feature vec- 
tors of all feasible structures in the ensemble of a given input sequence. Also, that convex hull 
coincides with the Newton polytope of the partition function as a polynomial in transformed energy 
parameters. 

Our theory applied to a simple energy model that counts A-U and C-G base pairs separately 
revealed that about one third of chosen known structures could potentially be predicted using this 
simple energy model. For another one third, the necessary condition is barely violated, which sug- 
gests that augmenting this energy model with more features is expected to solve the problem. The 
condition is severely violated for 8% of sequences, which will be the subject of future investigation. 
The twilight zone is also interesting and requires deeper examination. 

The Newton polytope lies in the core of computer algebra for solving polynomial equations. 
Therefore, we envision applications of our RNA Newton polytope in symbolic estimation of energy 
parameters. Sufficient conditions for the learnability, and also assessing the generalization power 
of an energy model remain for future work. 



8 




(a) Secondary structure of h.5.b.E.coli.hlxnum. 



Final Polygon for h.5.b.E.co!i.hlxnum 




15 20 
Number of AU Base Pairs 



(b) The Newton polygon of h.5.b.E.coli.hlxnum, with 12 vertices. 



Figure 1: The secondary structure and Newton polygon of h.5.b.E.coli.hlxnum in RNA STRAND 
v2.0 |32j . The native feature vector lies on the boundary in this case. 
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(a) Secondary structure of Hammerhead Ribozyme (Type I). 



Final Polygon lor Hammerhead ribozyme (type I) 




5 10 15 20 25 3D 35 40 45 50 



Number of AU Base Pairs 

(b) The Newton polygon of Hammerhead Ribozyme (Type I), with 15 vertices. 

Figure 2: The secondary structure and Newton polygon of Hammerhead Ribozyme (Type I) in 
RNA STRAND v2.0 |32j . The native feature vector does not lie on the boundary in this case. 
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Distance Histogram 
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0.15 0.2 0.25 0.3 
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Number of Polygon Vertices vs. Strand Length 



Strand Length 



Figure 3: (Top) Histogram of r. (Middle) Histogram of r(x)/ ' \J area(A/"(x)). (Bottom) Scatter plot 
of the number of vertices of the Newton polygon as a function of sequence length. 
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